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Abstract. Crowd monitoring and analysis in mass events are highly 
important technologies to support the security of attending persons. Pro- 
posed methods based on terrestrial or airborne image/video data often 
f-^ fail in achieving sufficiently accurate results to guarantee a robust service. 
^^ We present a novel framework for estimating human count, density and 
^^ motion from video data based on custom tailored object detection tech- 
C^ niques, a regression based density estimate and a total variation based 
O^ optical flow extraction. From the gathered features we present a detailed 
accuracy analysis versus ground truth measurements. In addition, all 

> information is projected into world coordinates to enable a direct integra- 

tion with existing geo-information systems. The resulting human counts 
\^ demonstrate a mean error of 4% to 9% and thus represent a most efficient 

fj^ measure that can be robustly applied in security critical services. 

I 1 Introduction 

> 

CO The recognition of critical situations in crowded scenes is very important to 

^~~^ prevent escalations and human casualties. On large scale events, like music 

^y^ festivals or sport events, important parameters for estimating the riskiness of 

a situation are, as follows, the number of persons, the density of individuals 
J^ per square meter, the general motion direction of groups of people and motion 

patterns (like dangerous forward and backwards motions in front of a stage or an 
entrance). These parameters can be used to estimate the human pressure which 
indicates potential locations of violent crowd dynamics [5]. Despite the huge 
number of security forces and crowd control efforts, hundreds of lives are lost in 
crowd disasters each year (like at Roskilde Festival in 2000, or in Mina/Makkah 



?-H during the Hajj in 2006, or in Duisburg at Love Parade in 2010). In the future, 

the presented framework will provide sufficiently robust cues to prevent such 
disastrous incidences. 

In this paper we introduce a setup based on HD video data which can either 
be captured from a tower-mounted camera or from an airborne vehicle (air- 
plane, helicopter, UAV). The resulting video, capturing parts of the crowded 
scene, is analyzed with computer vision techniques which extract the target pa- 
rameters (count, density, motion). To be able to pipe such information in a 
crowd simulation framework the per-pixel information has to be geo-referenced 
into a world-coordinate system. This enables to measure in physical units, e.g. 
number of persons per square meter and motion in meters per second. A crucial 
parameter to detect critical situations in humans crowds is the human pressure 
P, defined by P{x,t) = p{x,t)Ya.T{V{x,t)) where x is the spatial location, t 
the time, p the estimated density and V the motion [5]i which can be estimated 
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employing the proposed framework. Such information can then be used to alert 
security staff who then triggers appropriate actions, like opening or closing a 
gate or restricting the access of following people. 

Our contribution. The main difference in our approach to the related work 
is to apply higher order features for density estimation and provide an accurate 
performance analysis in a geo-referenced framework, such as, using an object 
detector tailored for person detection, learning the density estimate from im- 
age features w.r.t. a given ground truth (can be seen as an automatic feature 
selection) and rectifying all information from 2D image geometry to 3D world 
coordinates. In addition, the proposed framework is general and could be com- 
bined with any existing visual features, with any object category and with any 
object detection method. For example, it could be applied - appropriate features 
presumed - to count trees or cars in airborne videos. 

2 State of the art 

Some principles for crowd monitoring and person counting have been published. 
For example, [2 count people in an outdoor scenario based on a fixed mounted 
static video camera using a motion segmentation followed by a feature extrac- 
tion that serves as input for a Gaussian regression model. The main drawback 
w.r.t. our application is the prior motion segmentation. Such a system can 
only identify moving people, therefore all standing people are not counted. In 
addition, other moving objects like cars or pets will also appear in the motion 
segmentation. Authors of [T] detect individual people and crowd outlines from 
airborne nadir looking images. While isolated persons are detected using a cus- 
tom tailored object detector, regions containing crowds are recognized when 
many local features (features from accelerated segment test (FAST)) jointly oc- 
cur. The work does not contain an accuracy analysis and lacks a concept of 
how to map potential crowd regions to estimated person counts. It also seem 
problematic to define regions of crowds by low- level features, as in an arbitrary 
scenario also other objects than people will give a high FAST response (like 
e.g. textured vegetated areas). The work of [TT] also deals with airborne nadir 
looking images. This very interesting approach is similar to our methodology 
in terms that it extracts local features (in this case again FAST) and uses them 
to estimate the crowd density. The authors also include a feature selection step 
to reject local features which potentially are not corresponding to persons. The 
density itself is extracted using a kernel density estimate based on the feature 
occurrence. The number of individuals is spatially aggregated also using the 
FAST responses. 

In the following we discuss related work in particular for object counting, 
density estimation, motion estimation and geo-referencing. 
Object Counting and Density Estimation. There are three main method- 
ologies: (1) Counting by detection: The idea is to detect each individual object 
instance in the image and count their number (actually this is how human 
count). However, in computer vision object detection is far from being solved 
[3] and the detection is a harder problem than counting alone. Huge problems 
arise when objects are overlapping and occlude each other. (2) Counting by 
regression: Those methods try to find a mapping from various image features 
to the number of objects using supervised machine learning methods. How- 
ever, those methods do not use the location of the objects in the image instead 
they just find the regression to a single number, i.e. the number of objects. 
Therefore, huge training datasets are necessary to achieve useful results [B]. (3) 
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Counting by density estimation: The main concept is to estimate an object 
density function whose integral over any image region gives the count of objects 
within this region [5]. For learning the proposed methods employ the ground 
truth location of objects and the learning can be posed as a convex linear or 
quadratic program. An additional benefit of the method is that after learning 
the density function can be estimated by simple multiplication of the individual 
features with learned weights and is therefore very efficient. 
Motion Estimation. Estimating small motions from adjacent video frames is 
considered to be solved, or to state it differently, the accuracy of state-of-the- 
art algorithms are sufficient for our needs. The so-called optical flow can be 
extracted by total variation methods in image geometry, e.g. [151. 
Geo-Referencing. Geo-referencing, also called ortho-rectification, is a stan- 
dard method in photogrammetry and in remote sensing (cf. e.g. [7]) which 
projects the image onto the earth's surface in a given map projection. To be 
able to handle the distortions due to the topography a digital surface model 
(DSM) is used (global digital surface models hke SRTMQ or ASTER GDEM|^ 
are freely available). If the terrain is rather flat the DSM can be replaced by the 
knowledge of the mean terrain height. For areas containing many obstacles like 
stages, bridges, etc. a laser scanner model will deliver most accurate results. 

3 Methods 

3.1 Workflow 

The proposed approach is sketched in Figure [lland in Figure |2] The main idea 
is to extract image features which are related to the human density by machine 
learning techniques. We employ discretized features where the learning provides 
a weight for each feature number. Thus, after learning the density function can 
be calculated by simple multiplications. In addition, the density estimate is a 
real density function, meaning that the integral over the density yields the object 
count (therefore, the integral over a subregion holds the number of objects in 
this particular region). The motion between video frames is extracted using a 
variational method. All gathered information is then geo-referenced and can 
therefore be visualized and processed in any geographic information system. 
Figure |2] shows a video frame superimposed with the estimated density and 
motion and the same information geo-referenced and overlayed in Google Earth. 



3.2 Object Counting and Density Estimation 

For object counting and density estimation we employ the method by [8 . This 
method takes dense discretized feature maps extracted from the input images 
and learns the density estimate via a regression to a ground truth density. 
Thus, each pixel has to be described by a feature vector of the following form 
f — (0, 0, . . . , 0, 1, 0, . . . , 0) which is 1 at the dimension of the corresponding dis- 
cretized feature and otherwise 0. Since we want to detect persons we apply the 
object detector of |1] with the learned model for persons of the VOC 2009 chal- 
lenge [3]. This detector yields confidence values which have to be discretized. 
As we know from experience and previous tests that very small and very high 
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Figure 1: Our proposed workflow for human density estimation: An image 
with annotated humans (yellow dots), discretized features (in this specific case 
the results of an object detector), the learned weights for each feature and the 
estimated human density function (estimated count equals 250) are shown. 




Figure 2: Geo-referencing of a given image, the human density and motion esti- 
mate for test site Lakeside: (left) input image with superimposed color coded hu- 
man density function, motion, and estimated number of individuals and (right) 
the geo-referenced version of (left) shown as Google Earth'^ overlay. 



confidences are useless for object counting, we set the minimal value to —4.0 
and the maximal to —0.6 for all tests. High confidences usually only occur on 
isolated non-occluded persons, i.e. not in crowds. If we would not saturate 
the confidences, the density estimation would put too much emphasize on such 
objects. These bounds are used to scale the confidences to [0, 255] € N. Now, 
each of the possible 256 values define a feature vector, as discusses above, which 
is 1 at the position of the confidence value. Therefore, it yields 256 individual 
features (cf. Figure [I]). In addition, we extract dense scale-invariant feature 
transform (SIFT) descriptors [9^ using the implementation in [13. for each pixel. 
To be able to discretize this information we take 256 SIFT prototypes [8] and 
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the closest prototype for each descriptor defines the quantized SIFT number. 
Therefore, for each pixel we get a discretized SIFT value in [0, 255] S N. These 
additional 256 features are employed to test if simpler cues than object detec- 
tor confidences could yield useful results. For evaluation we train the density 
estimation framework for each feature class individually and for both, which is 
done by stacking the features. 

The training itself minimizes the regularized Maximum Excess over Sub Ar- 
rays (MESA) distance where we use the Li and the Tikhonov regularization 
[12] to solve the linear or quadratic equation system (i.e. min^; ||Aa; — 5|| or 
min^; \\Ax- h\\ + ||(a;'ra;)/2|| with ||a; > 0|| and Tikhonov matrix F being the 
identity matrix in our case). All details of this methodology are given in [S]. 
The result is a weight for each of the discretized features and the resulting den- 
sity is calculated by multiplying the according weight with the extracted feature 
value. Thus, for each pixel the density function is given and the sum over all 
pixels represents the number of objects in the image, i.e. our person count. 

Therefore, in the testing phase the discretized features are extracted for 
each image and multiplied by the learned weight vector directly resulting in the 
density estimation per pixel and corresponding person count. It should be noted 
that this approach introduces virtually no overhead over feature extraction [5]. 
In case of very efficient feature extraction methods, like decision tree and forests 
[TD] or cascades of boosted weak classifiers [H], the whole density estimation 
would also run in real time. 

3.3 Motion Estimation 

The motion is estimated based on the optical fiow in image geometry [15] where 
we used the implementation aij^] To get a more robust estimate the flow is 
not gathered from two adjacent video frames but from frames with a temporal 
distance of 10 frames. In addition a given number of those flows are temporally 
averaged to ensure smooth motion vectors. 

3.4 Geo-Referencing 

To keep it simple we deflne a common map frame for each of our test sites in 
WGS84 UTM 33 North projection (EPSG 32633) since our sites are located 
in western Austria, Europe. Then for each image and for each column/line 
coordinate the according 3D world coordinate is calculated which are used to 
rectify the density and motion information. 

Density. For geo-referencing the density we project each density pixel into the 
common frame. If a pixel gets hit more than once the values are summed up. 
This ensures that the sum of the density, i.e. the human count, stays the same 
in image and world coordinates. Since it happens that some pixels are hit more 
often than their neighbors due to aliasing, the whole geo-referenced density is 
smoothed using a Gaussian kernel. 

Motion. In image geometry we cannot differentiate between object motion and 
camera motion. Therefore, we transform the reference 2D image coordinate and 
the according search 2D images coordinate (i.e. gathered via optical flow) into 
3D world coordinates. These two world coordinates deflne the real object motion 
vector independent of the camera movement. Since the temporal difference of 
the two input video frames is know, the speed of the motion can be calculate in 
meters per second. 
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4 Results 



4.1 Test Data 



For evaluation of the presented concept videos from two different scenarios were 
acquired in HD quality. The first one, referred as Lakeside, originates from a 
music festival in Styria, Austria (cf. Figure ^. The video camera was mounted 
on a tower (approximately 30 meters above ground) . The camera was therefore 
more or less static with small jiggling due to wind. To geo-reference the scene 
only one image was manually rectified and defines the geometry for all other 
images. The second scenario, called Donauinsel, originates from a huge open air 
festival in Vienna, Austria (cf . Figure l3| . Here the video camera was mounted 
on an airplane. For geo-referencing, the meta-data (GPS/IMU) supplied by the 
camera system was taken for each frame. Since every frame has a different ex- 
terior parameters, it was necessary to geo-reference every frame independently. 
Table [T] lists the details of the video setups and parameters. We also man- 
ually labeled many frames to get the ground truth person counts in training 
and later in the testing phase (overall over 23500 persons were annotated with 
a mean height of 90 pixels, cf. Table l2l. It is important to note that the 
scenes for learning are similar however different than the testing scenes. Since 
the Lakeside scenario contains a much larger data set, most of the experimental 
results are focused on this set. The Donauinsel scenario contains insufficient 
images for sustainable training and testing. In addition, the density estimate is 
evaluated in detail since the motion estimation can be solved by state-of-the-art 
algorithms. 





Image size 
in pixels 


Frame 
rate 


Number of 
frames 


Length 
in m:ss 


Camera parameters 


Lakeside 


1440 X 1080 


25 


6801 


4:32 


Canon HV30 camera 
fixed mounted on a tower 


Donau- 
insel 


1280 X 720 


50 


721 


0:14 


FLIR Star Safire HD camera 
mounted on DA42 MPP airplane^ 



Table 1: Test video data sets for the two scenarios. 



Lakeside 


nr. of 
images 


total 


persons 
mean 


std 


Training 


12 


3154 


263 


7.3 


Testing 


68 


18884 


278 


13.2 



Donau- 
insel 


nr. of 

images 


total 


persons 
mean 


std 


Training 


5 


672 


134 


41.7 


Testing 


6 


848 


141 


35.8 



Table 2: Manually labeled persons for the two scenarios. 



4.2 Density Estimation 

Learning. The accuracy of the learning process is listed in Table |3] It can be 
seen that the used object detector has a better impact on the density estimation 
than the dense SIFT descriptors in 7 of 8 cases (the one exception stems for Li 
based regularization, which is in general unstable). Using both features increases 
the accuracy. It is also interesting that the two regularizations yield similar 
results, even though the learned weights are very different. Overall, the Li 
regularization tends towards a zero-solution, i.e. setting many weights to zero, 
while the Tikhonov regularization populates the weights a lot smoother (this is 
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Figure 3: Geo-referencing of a given image for the test site Donauinsel: (left) 
Airborne video frame and (right) the geo-referenced version of (left) overlayed 
on a true ortho image with 4cm GSD. 

a property of the Tikhonov regularization, as it improves the condition of the 
problem and enables a more stable numerical solution). This aspect seems not 
important for the learning set, however it changes the performance in the testing 
phase. If e.g. we have a slight motion blur in one of the images, the according 
Li weights drop to zero, while the Tikhonov weights do not. For the Donauinsel 
scenario the Tikhonov based regularization yields a lower accuracy than Li in 
case of dense SIFTs. We assume that the low number of learning samples and 
the unfavorable mapping of discretized SIFT values to the real occurrence of 
persons (the stage rack contains many vertical structures, i.e. the same features 
of a person) yield a bad condition of the equation system and therefore the 
solution tends to a local minimum instead of the global one. While learning 
based in Li regularization picks a few SIFT keys and a few object detection 
scores (only 10), the Tikhonov based learning takes more SIFTs and a logical 
weight distribution of the object detector (in total 453). Where logical means 
that the learned weights are dependent on the object detector confidences. 



Lakeside 


train 
Li 


ing 

Tikhonov 


test 


ing 

Tikhonov 


object detector 


4.7 (1.8%) 


4.75 (1.8%) 


13.3 (4.8%) 


10.6 (3.8%) 


dense SIFT 


7.0 (2.7%) 


6.7 (2.5%) 


11.2 (4.0%) 


11.1 (4.0%) 


both 


4.5 (1.7%) 


4.4 (1.7%) 


10.8 (3.9%) 


10.0 (3.6%) 



Donauinsel 



Li 



training 

Tikhonov 



testing 



Li 



Tikhonov 



object detector 
dense SIFT 



7.1 (5.3%) 
7.0 (5.2%) 



7.0 (5.2%) 
10.3 (7.7%) 



12.7 (9.0%) 
15.9 (11.3%) 



10.0 (7.1%) 
18.0 (12.8%) 



both 



7.1 (5.3%) 



5.6 (4.2%) 



11.9 (8.4%) 



12.1 (8.6%) 



Table 3: Accuracy of density learning and testing. Given are the average errors 
of the total human count and the percental error over the training and test 
images, for two regularization options and different image features. 



Testing. The accuracy of the density estimation is given in Table [3J Like 
in the training phase the Tikhonov regularization yields slightly higher accura- 
cies than the Li one. On average the mean person counting error is 4% of the 
Lakeside and 9% for the Donauinsel data set. Figure ffl shows the estimated 
person count of Lakeside with superimposed manually measured ground truth. 
Both resulting curves are similar however the Tikhonov regularization creates 
a smoother result. Experimentally we can prove this assumption by taking a 
look at the temporal smoothness of the estimated person count. The standard 
deviation of per frame differences of the estimated count is 4.4 for Li regular- 
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ization and 3.8 for Tikhonov regularization (for Lakeside and when using both 
feature sets). Obviously, a lower number represents a more realistic setting, as 
the number of persons in two adjacent frames should not vary much. When 
taking a close look to Figure |4]a rather huge error is visible towards the end of 
the sequence (image number 6500 to 6700). The reason for this issue are strong 
winds causing camera shaking and therefore a motion blur in the images. Con- 
sequently, the extracted features are different to the learned weights resulting 
in a lower human density estimate. 



I.JI 







1000 2000 3000 4000 

test image number 



Figure 4: Person counting: Estimated person count using Li regularization 
(blue) and Tikhonov regularization (green) for the Lakeside scenario. The red 
dots indicate the manually measured ground truth for the test images. 



5 Conclusion 

In this work we presented a method for people counting and crowd monitoring 
from airborne imagery. The estimated parameters from a given video stream 
were human count and human density and motion for each pixel. This informa- 
tion was geo-referenced into a world coordinate system. Overall, the estimated 
human counts were highly accurate with resulting 4% and 9% count error for 
the two presented scenarios, which could be reached by employing a custom 
tailored object detector instead of simple images features amongst other im- 
plementation details. The proposed framework is therefore higly important for 
security applications. 

Outlook. Currently, the framework is optimized for oblique views and thus 
it will not yield reasonable accuracies when e.g. employing nadir images. We 
envision to train the system on several viewing conditions, where the object 
detector should also be custom tailored (like a detector for head and shoulders 
for oblique views and a blob-like detector for nadir views). The viewing condi- 
tion itself can be derived from the airplane's geo-sensors. When extracting the 
human densities the system is able to choose from the learned models accord- 
ing to the viewing parameters. Of course it would also be of interest to test 
different features and detectors on the accuracy and various regularizations for 
minimizing the MESA distance in the machine learning approach. 
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